vulnerability_clusters

1 Introduction

The objective here is to present a typology of climate vulnerability, identifying factors and indicators that determine a village’s susceptibility to climate change.

1.1 Pipeline

A flowchart includes steps in performing PCA.

1.2 Loading the data

Reading the Data: We have an Excel file with lots of information. Each row in the file gives us details about a village, and the columns tell us about different factors that might make the village more or less vulnerable to climate change. There’s also a special sheet in the Excel file that explains what all these factors mean, their unique IDs, and how they’re measured.

1.2.1 The main dataset raw_df

Code
library(readxl)
library(dplyr)
library(corrplot)
library(caret)
library(gtExtras)
library(naniar)

df_raw <-
  read_xlsx("data/analisa_tipologi_sulsel.xlsx",
            sheet = "fin_copas") |> 
  janitor::clean_names() |> dplyr::select(-c(idkec_2, idkec_3)) |> 
  mutate(across(-c(idkec_dum, sumber, nmprov, nmkab, nmkec), as.numeric))

head(df_raw)
# A tibble: 6 × 59
  idkec_dum sumber kdprov kdkab kdkec nmprov           nmkab       nmkec periode
  <chr>     <chr>   <dbl> <dbl> <dbl> <chr>            <chr>       <chr>   <dbl>
1 7301010a  BPS        73     1    10 SULAWESI SELATAN KEPULAUAN … PASI…    2019
2 7301011a  BPS        73     1    11 SULAWESI SELATAN KEPULAUAN … PASI…    2019
3 7301020a  BPS        73     1    20 SULAWESI SELATAN KEPULAUAN … PASI…    2019
4 7301021a  BPS        73     1    21 SULAWESI SELATAN KEPULAUAN … TAKA…    2019
5 7301022a  BPS        73     1    22 SULAWESI SELATAN KEPULAUAN … PASI…    2019
6 7301030a  BPS        73     1    30 SULAWESI SELATAN KEPULAUAN … BONT…    2019
# ℹ 50 more variables: distance_to_plantation <dbl>, distance_to_road <dbl>,
#   distance_to_commodity_processing_factory <dbl>,
#   distance_to_plantation_concession <dbl>, distance_to_forest <dbl>,
#   distance_to_river <dbl>, distance_to_burned_area <dbl>,
#   percentage_of_agricultural_area_small_holder_in_the_village <dbl>,
#   percentage_of_plantation_area_per_sub_district <dbl>,
#   percentage_of_forested_area_in_the_sub_district <dbl>, …

Now we have a big table called df_raw. This table contains information like the name of the province, district, sub-district, and village. It also includes data about the village’s area, population density, and lots of other numerical factors.

Code
# Remove unnecessary columns from the original dataframe 'df_raw'
# The 'dplyr::select()' function is used to exclude these columns
df_gt <- df_raw |> 
  dplyr::select(-c(idkec_dum, kdprov, sumber, nmprov, nmkab, nmkec, periode, kdkab, kdkec, prec_change))

df_col_unstandardise <- df_raw |> dplyr::select(prec_change)

# Add a small constant (0.001) to every value in 'df_gt'
# This is often done to allow for log transformations of data that includes zeros
df_gt_plus <- df_gt + 0.001



# Log-transform and scale the data
# 1. 'as_tibble()' converts the dataframe to a tibble for easier manipulation
# 2. 'mutate_all(.funs = log10)' applies the base-10 logarithm to all columns
# 3. 'scale()' standardizes each column so that it has mean=0 and sd=1
scaled_df <- df_gt_plus |> as_tibble() |> mutate_all(.funs = log10) |>bind_cols(df_col_unstandardise) |>  scale(center = TRUE, scale = TRUE)

# Add back the columns that were removed earlier to create a complete, scaled dataframe
# 'bind_cols()' binds the selected columns from 'df_raw' and the scaled columns from 'scaled_df'
scaled_df_complete <- df_raw |> 
  dplyr::select(c(idkec_dum, kdprov, sumber, nmprov, nmkab, nmkec, periode, kdkab, kdkec)) |>  
  bind_cols(scaled_df)

# Remove the column 'korban_jiwa_kebakaran_hutan_dan_lahan_2018_2019' from the complete, scaled dataframe
scaled_df_complete_temp <- scaled_df_complete |> 
  dplyr::select(-korban_jiwa_kebakaran_hutan_dan_lahan_2018_2019)

2 Preparing the data

Before diving into the PCA, let’s ensure that the data meets the necessary requirements.

2.1 Check for missing values

Code
# Check for missing values
missing_data <- sapply(scaled_df_complete_temp, function(x) sum(is.na(x)))
print(missing_data)
                                                                  idkec_dum 
                                                                          0 
                                                                     kdprov 
                                                                          0 
                                                                     sumber 
                                                                          0 
                                                                     nmprov 
                                                                          0 
                                                                      nmkab 
                                                                          0 
                                                                      nmkec 
                                                                          0 
                                                                    periode 
                                                                          0 
                                                                      kdkab 
                                                                          0 
                                                                      kdkec 
                                                                          0 
                                                     distance_to_plantation 
                                                                          0 
                                                           distance_to_road 
                                                                          0 
                                   distance_to_commodity_processing_factory 
                                                                          0 
                                          distance_to_plantation_concession 
                                                                          0 
                                                         distance_to_forest 
                                                                          0 
                                                          distance_to_river 
                                                                          0 
                                                    distance_to_burned_area 
                                                                          0 
                percentage_of_agricultural_area_small_holder_in_the_village 
                                                                          0 
                             percentage_of_plantation_area_per_sub_district 
                                                                          0 
                            percentage_of_forested_area_in_the_sub_district 
                                                                          0 
                                percentage_of_shrubland_in_the_sub_district 
                                                                          0 
                     percentage_of_water_area_compared_to_sub_district_area 
                                                                          0 
                                                  distance_to_deforestation 
                                                                          0 
                                         percentage_deforestation_area_size 
                                                                          0 
                                                        arable_land_percent 
                                                                          0 
                                                   erosion_risk_t_ha_1_yr_1 
                                                                          0 
                                                       indeks_bahaya_banjir 
                                                                          0 
                                                      indeks_bahaya_longsor 
                                                                          0 
                                               buffer_to_500m_irigated_land 
                                                                          0 
                                                              aridity_index 
                                                                          0 
                     total_kk_berdasarkan_pengguna_dan_non_pengguna_listrik 
                                                                          2 
                                                        rasio_elektrifikasi 
                                                                          2 
                                         rasio_sekolah_tinggi_sma_sederajat 
                                                                          2 
                                                                   rasio_pt 
                                                                          2 
                                                                   rasio_rs 
                                                                          2 
                                                             rasio_faskes_1 
                                                                          2 
                                                                rasio_pasar 
                                                                          2 
                                                  rasio_minimarket_swalayan 
                                                                          2 
                                    banyak_kejadian_tanah_longsor_2018_2019 
                                                                          2 
                                        korban_jiwa_tanah_longsor_2018_2019 
                                                                          2 
                                           banyak_kejadian_banjir_2018_2019 
                                                                          2 
                                               korban_jiwa_banjir_2018_2019 
                                                                          2 
                                   banyak_kejadian_banjir_bandang_2018_2019 
                                                                          2 
                                       korban_jiwa_banjir_bandang_2018_2019 
                                                                          2 
                        banyak_kejadian_kebakaran_hutan_dan_lahan_2018_2019 
                                                                          2 
                                 banyak_kejadian_kekeringan_lahan_2018_2019 
                                                                          2 
                                     korban_jiwa_kekeringan_lahan_2018_2019 
                                                                          2 
                                 jumlah_sistem_peringatan_dini_bencana_alam 
                                                                          2 
                             persentase_sistem_peringatan_dini_bencana_alam 
                                                                          2 
                                                  rasio_embung_di_kecamatan 
                                                                          2 
rasio_pasar_desa_pasar_hewan_pelelangan_ikan_pelelangan_hasil_pertanian_dll 
                                                                          2 
 jumlah_warga_penderita_gizi_buruk_marasmus_dan_kwashiorkor_pada_tahun_2018 
                                                                          2 
  rasio_warga_penderita_gizi_buruk_marasmus_dan_kwashiorkor_pada_tahun_2018 
                                                                          2 
                                                           annual_mean_temp 
                                                                          0 
                                                                temp_change 
                                                                          0 
                                                           annual_mean_prec 
                                                                          0 
                                                rasio_40pers_ekon_rendah_rt 
                                                                          2 
                                              rasio_40pers_ekon_rendah_indv 
                                                                          2 
                                                                prec_change 
                                                                          0 
Code
scaled_df_complete_temp %>% filter_all(any_vars(is.na(.)))
# A tibble: 2 × 58
  idkec_dum kdprov sumber nmprov           nmkab      nmkec  periode kdkab kdkec
  <chr>      <dbl> <chr>  <chr>            <chr>      <chr>    <dbl> <dbl> <dbl>
1 7313000a      73 BPS    SULAWESI SELATAN WAJO       WAJO      2019    13     0
2 7325000a      73 BPS    SULAWESI SELATAN LUWU TIMUR LUWU …    2019    25     0
# ℹ 49 more variables: distance_to_plantation <dbl>, distance_to_road <dbl>,
#   distance_to_commodity_processing_factory <dbl>,
#   distance_to_plantation_concession <dbl>, distance_to_forest <dbl>,
#   distance_to_river <dbl>, distance_to_burned_area <dbl>,
#   percentage_of_agricultural_area_small_holder_in_the_village <dbl>,
#   percentage_of_plantation_area_per_sub_district <dbl>,
#   percentage_of_forested_area_in_the_sub_district <dbl>, …
Code
#remove missing data
scaled_df_complete_temp <- scaled_df_complete_temp[complete.cases(scaled_df_complete_temp), ]

2.2 Exclude identifiers

Code
df_pre_pca <- scaled_df_complete_temp |> 
  select(-c(idkec_dum, sumber, kdprov, nmprov, nmkab, nmkec, periode, kdkab, kdkec))

glimpse(df_pre_pca)
Rows: 310
Columns: 49
$ distance_to_plantation                                                      <dbl> …
$ distance_to_road                                                            <dbl> …
$ distance_to_commodity_processing_factory                                    <dbl> …
$ distance_to_plantation_concession                                           <dbl> …
$ distance_to_forest                                                          <dbl> …
$ distance_to_river                                                           <dbl> …
$ distance_to_burned_area                                                     <dbl> …
$ percentage_of_agricultural_area_small_holder_in_the_village                 <dbl> …
$ percentage_of_plantation_area_per_sub_district                              <dbl> …
$ percentage_of_forested_area_in_the_sub_district                             <dbl> …
$ percentage_of_shrubland_in_the_sub_district                                 <dbl> …
$ percentage_of_water_area_compared_to_sub_district_area                      <dbl> …
$ distance_to_deforestation                                                   <dbl> …
$ percentage_deforestation_area_size                                          <dbl> …
$ arable_land_percent                                                         <dbl> …
$ erosion_risk_t_ha_1_yr_1                                                    <dbl> …
$ indeks_bahaya_banjir                                                        <dbl> …
$ indeks_bahaya_longsor                                                       <dbl> …
$ buffer_to_500m_irigated_land                                                <dbl> …
$ aridity_index                                                               <dbl> …
$ total_kk_berdasarkan_pengguna_dan_non_pengguna_listrik                      <dbl> …
$ rasio_elektrifikasi                                                         <dbl> …
$ rasio_sekolah_tinggi_sma_sederajat                                          <dbl> …
$ rasio_pt                                                                    <dbl> …
$ rasio_rs                                                                    <dbl> …
$ rasio_faskes_1                                                              <dbl> …
$ rasio_pasar                                                                 <dbl> …
$ rasio_minimarket_swalayan                                                   <dbl> …
$ banyak_kejadian_tanah_longsor_2018_2019                                     <dbl> …
$ korban_jiwa_tanah_longsor_2018_2019                                         <dbl> …
$ banyak_kejadian_banjir_2018_2019                                            <dbl> …
$ korban_jiwa_banjir_2018_2019                                                <dbl> …
$ banyak_kejadian_banjir_bandang_2018_2019                                    <dbl> …
$ korban_jiwa_banjir_bandang_2018_2019                                        <dbl> …
$ banyak_kejadian_kebakaran_hutan_dan_lahan_2018_2019                         <dbl> …
$ banyak_kejadian_kekeringan_lahan_2018_2019                                  <dbl> …
$ korban_jiwa_kekeringan_lahan_2018_2019                                      <dbl> …
$ jumlah_sistem_peringatan_dini_bencana_alam                                  <dbl> …
$ persentase_sistem_peringatan_dini_bencana_alam                              <dbl> …
$ rasio_embung_di_kecamatan                                                   <dbl> …
$ rasio_pasar_desa_pasar_hewan_pelelangan_ikan_pelelangan_hasil_pertanian_dll <dbl> …
$ jumlah_warga_penderita_gizi_buruk_marasmus_dan_kwashiorkor_pada_tahun_2018  <dbl> …
$ rasio_warga_penderita_gizi_buruk_marasmus_dan_kwashiorkor_pada_tahun_2018   <dbl> …
$ annual_mean_temp                                                            <dbl> …
$ temp_change                                                                 <dbl> …
$ annual_mean_prec                                                            <dbl> …
$ rasio_40pers_ekon_rendah_rt                                                 <dbl> …
$ rasio_40pers_ekon_rendah_indv                                               <dbl> …
$ prec_change                                                                 <dbl> …

No missing data is found. Handle missing values if any, possibly through imputation

2.3 Scale the data

2.4 Identify and remove multi-collinear variables

Identifying and removing multicollinearity is an essential step before performing PCA, as PCA assumes that the features are linearly independent. Compute the correlation matrix and identify highly correlated variables:

Code
#df_pre_pca_abb <- df_pre_pca
# Abbreviate column names using R's built-in abbreviate function
#abbrev_colnames <- abbreviate(colnames(df_pre_pca_abb))

# Assign the abbreviated names back to the dataframe
#colnames(df_pre_pca_abb) <- abbrev_colnames

# Recalculate and plot the correlation matrix
cor_matrix <- cor(df_pre_pca) |> round(digits = 2)

# Convert the matrix to a data frame
cor_matrix_df <- as.data.frame(cor_matrix)

# Add row names as a new column
cor_matrix_df$Predictors <- rownames(cor_matrix)

# Move the 'RowNames' column to the first position
cor_matrix_df <- cor_matrix_df %>% select(Predictors, everything())



# Create the gt table
cor_matrix_df %>%
  gt() %>%
  data_color(
    columns = -Predictors,  # Exclude the 'RowNames' column from colorization
    colors = scales::col_numeric(
      palette = c("darkred", "white", "darkblue"),
      domain = c(-1, 1)
    )
  )
Warning: Since gt v0.9.0, the `colors` argument has been deprecated.
• Please use the `fn` argument instead.
This warning is displayed once every 8 hours.
Predictors distance_to_plantation distance_to_road distance_to_commodity_processing_factory distance_to_plantation_concession distance_to_forest distance_to_river distance_to_burned_area percentage_of_agricultural_area_small_holder_in_the_village percentage_of_plantation_area_per_sub_district percentage_of_forested_area_in_the_sub_district percentage_of_shrubland_in_the_sub_district percentage_of_water_area_compared_to_sub_district_area distance_to_deforestation percentage_deforestation_area_size arable_land_percent erosion_risk_t_ha_1_yr_1 indeks_bahaya_banjir indeks_bahaya_longsor buffer_to_500m_irigated_land aridity_index total_kk_berdasarkan_pengguna_dan_non_pengguna_listrik rasio_elektrifikasi rasio_sekolah_tinggi_sma_sederajat rasio_pt rasio_rs rasio_faskes_1 rasio_pasar rasio_minimarket_swalayan banyak_kejadian_tanah_longsor_2018_2019 korban_jiwa_tanah_longsor_2018_2019 banyak_kejadian_banjir_2018_2019 korban_jiwa_banjir_2018_2019 banyak_kejadian_banjir_bandang_2018_2019 korban_jiwa_banjir_bandang_2018_2019 banyak_kejadian_kebakaran_hutan_dan_lahan_2018_2019 banyak_kejadian_kekeringan_lahan_2018_2019 korban_jiwa_kekeringan_lahan_2018_2019 jumlah_sistem_peringatan_dini_bencana_alam persentase_sistem_peringatan_dini_bencana_alam rasio_embung_di_kecamatan rasio_pasar_desa_pasar_hewan_pelelangan_ikan_pelelangan_hasil_pertanian_dll jumlah_warga_penderita_gizi_buruk_marasmus_dan_kwashiorkor_pada_tahun_2018 rasio_warga_penderita_gizi_buruk_marasmus_dan_kwashiorkor_pada_tahun_2018 annual_mean_temp temp_change annual_mean_prec rasio_40pers_ekon_rendah_rt rasio_40pers_ekon_rendah_indv prec_change
distance_to_plantation 1.00 0.06 0.17 -0.42 -0.01 0.33 0.02 -0.22 -0.04 0.02 0.08 0.08 0.04 0.04 -0.18 -0.25 -0.14 -0.10 -0.17 -0.30 0.00 -0.04 0.06 0.04 0.02 -0.06 0.15 -0.19 -0.06 0.07 -0.04 0.04 0.10 0.13 -0.06 -0.06 -0.02 -0.11 0.07 0.10 -0.12 0.09 0.08 -0.12 -0.14 -0.14 0.11 0.08 0.07
distance_to_road 0.06 1.00 0.21 -0.01 -0.45 0.30 0.17 -0.12 -0.56 0.66 0.06 -0.13 -0.31 0.55 0.08 -0.11 -0.42 0.47 -0.21 0.05 -0.45 -0.29 -0.10 -0.33 -0.36 -0.22 0.15 -0.32 0.24 0.02 -0.04 -0.04 -0.06 -0.03 0.06 0.08 0.00 -0.07 -0.48 0.07 0.07 0.02 -0.01 -0.11 -0.06 -0.06 0.31 0.28 0.13
distance_to_commodity_processing_factory 0.17 0.21 1.00 -0.30 0.14 0.37 0.22 0.11 -0.22 0.04 0.16 -0.16 0.18 0.04 0.28 -0.19 -0.02 -0.05 -0.17 -0.60 0.00 -0.03 -0.12 -0.15 -0.17 -0.06 0.14 0.00 -0.10 -0.01 0.07 0.00 -0.02 -0.01 0.12 0.08 -0.09 0.14 -0.09 0.34 0.20 -0.01 0.01 0.05 0.02 -0.04 0.05 -0.07 -0.46
distance_to_plantation_concession -0.42 -0.01 -0.30 1.00 -0.05 -0.54 -0.45 0.14 0.04 0.03 0.07 -0.01 -0.03 -0.01 -0.08 0.17 0.02 0.05 0.05 0.59 0.02 -0.05 0.01 0.05 0.02 0.06 -0.26 0.11 0.07 -0.04 0.02 -0.03 -0.13 -0.15 0.04 0.01 0.03 0.09 -0.04 0.03 0.00 0.03 0.02 -0.06 -0.01 0.01 -0.06 0.01 0.09
distance_to_forest -0.01 -0.45 0.14 -0.05 1.00 -0.04 -0.04 0.34 0.31 -0.80 0.18 0.05 0.89 -0.72 0.24 -0.26 0.30 -0.52 0.20 -0.46 0.33 0.29 -0.08 0.04 0.02 0.22 -0.06 0.19 -0.31 -0.08 0.03 0.04 0.02 -0.05 -0.03 -0.03 -0.02 0.23 0.23 -0.01 0.07 -0.01 0.02 -0.08 -0.12 -0.16 -0.10 -0.13 -0.18
distance_to_river 0.33 0.30 0.37 -0.54 -0.04 1.00 0.41 -0.28 -0.22 0.15 0.00 0.02 0.02 0.14 -0.07 -0.27 -0.41 0.15 -0.33 -0.38 -0.22 -0.11 -0.04 -0.27 -0.22 -0.21 0.17 -0.30 0.05 0.05 -0.25 -0.13 -0.11 -0.07 0.05 -0.03 0.01 -0.12 -0.17 0.07 0.03 -0.04 -0.04 -0.21 -0.23 -0.23 0.17 0.10 -0.10
distance_to_burned_area 0.02 0.17 0.22 -0.45 -0.04 0.41 1.00 -0.20 0.00 0.08 -0.04 -0.03 -0.08 0.20 -0.02 -0.13 -0.03 -0.02 -0.17 -0.27 -0.18 -0.02 0.02 -0.11 -0.04 -0.14 0.20 -0.15 -0.04 -0.11 -0.10 -0.11 -0.16 -0.04 -0.03 -0.02 0.01 -0.02 -0.19 -0.20 0.13 0.04 0.03 -0.10 -0.11 -0.12 0.00 -0.01 -0.19
percentage_of_agricultural_area_small_holder_in_the_village -0.22 -0.12 0.11 0.14 0.34 -0.28 -0.20 1.00 -0.16 -0.21 0.07 -0.09 0.33 -0.25 0.78 0.29 0.30 -0.04 0.25 0.02 0.14 0.10 -0.15 -0.09 -0.14 0.13 -0.12 0.22 -0.01 -0.01 0.23 0.12 0.12 0.05 -0.04 0.10 0.03 0.17 0.05 0.23 0.20 -0.02 -0.01 0.43 0.41 0.40 0.05 0.01 -0.20
percentage_of_plantation_area_per_sub_district -0.04 -0.56 -0.22 0.04 0.31 -0.22 0.00 -0.16 1.00 -0.40 0.04 0.09 0.21 -0.26 -0.30 -0.23 0.46 -0.57 0.34 -0.18 0.58 0.28 0.16 0.47 0.46 0.28 -0.18 0.36 -0.34 -0.02 0.10 -0.02 0.04 0.03 0.00 -0.02 0.04 0.03 0.66 -0.32 -0.12 0.02 0.02 -0.17 -0.21 -0.22 -0.48 -0.45 0.04
percentage_of_forested_area_in_the_sub_district 0.02 0.66 0.04 0.03 -0.80 0.15 0.08 -0.21 -0.40 1.00 -0.09 -0.09 -0.74 0.81 -0.07 0.18 -0.33 0.57 -0.17 0.33 -0.29 -0.34 -0.02 -0.20 -0.21 -0.23 0.13 -0.18 0.31 0.10 -0.01 -0.05 -0.01 0.02 0.12 0.13 0.04 -0.06 -0.30 0.10 0.04 0.06 0.04 0.05 0.08 0.11 0.27 0.25 0.07
percentage_of_shrubland_in_the_sub_district 0.08 0.06 0.16 0.07 0.18 0.00 -0.04 0.07 0.04 -0.09 1.00 0.01 0.20 -0.03 0.01 -0.11 0.17 -0.22 0.05 -0.14 0.12 0.02 -0.08 0.01 0.03 0.04 -0.01 0.11 -0.16 -0.03 0.17 -0.05 -0.07 -0.03 0.04 0.02 -0.01 0.12 0.06 0.05 0.12 0.01 0.02 0.04 0.04 0.01 -0.19 -0.21 -0.09
percentage_of_water_area_compared_to_sub_district_area 0.08 -0.13 -0.16 -0.01 0.05 0.02 -0.03 -0.09 0.09 -0.09 0.01 1.00 0.01 -0.06 -0.17 -0.01 0.06 -0.05 0.04 0.07 0.08 0.06 0.13 0.01 0.23 0.07 -0.02 0.03 -0.07 -0.02 0.02 -0.03 -0.05 -0.02 -0.01 -0.03 -0.01 -0.04 0.12 -0.07 -0.06 0.06 0.05 -0.17 -0.17 -0.16 -0.12 -0.10 0.07
distance_to_deforestation 0.04 -0.31 0.18 -0.03 0.89 0.02 -0.08 0.33 0.21 -0.74 0.20 0.01 1.00 -0.79 0.22 -0.29 0.20 -0.43 0.13 -0.46 0.21 0.24 -0.12 -0.04 -0.05 0.18 -0.09 0.08 -0.27 -0.10 0.01 0.02 0.04 -0.07 -0.04 -0.04 0.02 0.18 0.12 0.00 0.06 -0.07 -0.05 -0.11 -0.14 -0.18 -0.01 -0.06 -0.05
percentage_deforestation_area_size 0.04 0.55 0.04 -0.01 -0.72 0.14 0.20 -0.25 -0.26 0.81 -0.03 -0.06 -0.79 1.00 -0.13 0.10 -0.23 0.35 -0.09 0.25 -0.21 -0.26 0.04 -0.12 -0.10 -0.20 0.13 -0.14 0.21 0.09 0.03 -0.01 -0.06 0.03 0.14 0.08 -0.05 -0.06 -0.21 0.02 0.05 0.10 0.07 0.01 0.03 0.06 0.11 0.12 0.03
arable_land_percent -0.18 0.08 0.28 -0.08 0.24 -0.07 -0.02 0.78 -0.30 -0.07 0.01 -0.17 0.22 -0.13 1.00 0.35 0.07 0.20 0.11 -0.07 -0.04 0.05 -0.17 -0.22 -0.26 0.06 0.04 0.08 0.07 0.03 0.13 0.07 0.10 0.03 0.04 0.04 0.01 0.07 -0.09 0.25 0.23 -0.08 -0.07 0.51 0.50 0.48 0.17 0.11 -0.27
erosion_risk_t_ha_1_yr_1 -0.25 -0.11 -0.19 0.17 -0.26 -0.27 -0.13 0.29 -0.23 0.18 -0.11 -0.01 -0.29 0.10 0.35 1.00 -0.05 0.44 -0.07 0.58 -0.11 -0.13 -0.05 -0.05 -0.02 -0.04 -0.02 0.04 0.30 0.05 -0.02 -0.08 0.03 -0.03 0.06 0.05 0.02 0.05 -0.16 0.12 0.07 -0.02 -0.02 0.66 0.69 0.71 0.06 0.07 0.04
indeks_bahaya_banjir -0.14 -0.42 -0.02 0.02 0.30 -0.41 -0.03 0.30 0.46 -0.33 0.17 0.06 0.20 -0.23 0.07 -0.05 1.00 -0.61 0.56 -0.18 0.54 0.29 0.01 0.31 0.26 0.33 -0.07 0.56 -0.40 -0.11 0.50 0.18 0.06 0.04 -0.06 0.05 -0.06 0.22 0.48 -0.15 0.09 0.12 0.14 0.26 0.20 0.18 -0.42 -0.42 -0.23
indeks_bahaya_longsor -0.10 0.47 -0.05 0.05 -0.52 0.15 -0.02 -0.04 -0.57 0.57 -0.22 -0.05 -0.43 0.35 0.20 0.44 -0.61 1.00 -0.48 0.43 -0.48 -0.25 -0.01 -0.33 -0.29 -0.21 0.08 -0.29 0.50 0.10 -0.17 -0.13 -0.04 -0.03 0.06 0.04 0.05 -0.13 -0.47 0.29 0.00 -0.07 -0.07 0.14 0.20 0.23 0.35 0.35 0.13
buffer_to_500m_irigated_land -0.17 -0.21 -0.17 0.05 0.20 -0.33 -0.17 0.25 0.34 -0.17 0.05 0.04 0.13 -0.09 0.11 -0.07 0.56 -0.48 1.00 -0.07 0.37 0.18 -0.07 0.18 0.12 0.21 -0.08 0.31 -0.32 -0.06 0.22 0.09 0.16 0.00 -0.01 0.11 -0.04 0.11 0.35 -0.23 0.12 0.09 0.08 0.12 0.10 0.08 -0.23 -0.22 0.07
aridity_index -0.30 0.05 -0.60 0.59 -0.46 -0.38 -0.27 0.02 -0.18 0.33 -0.14 0.07 -0.46 0.25 -0.07 0.58 -0.18 0.43 -0.07 1.00 -0.20 -0.17 0.07 -0.01 0.02 -0.05 -0.12 -0.07 0.32 0.05 -0.06 -0.04 -0.14 -0.12 0.01 -0.03 0.06 -0.04 -0.21 -0.05 -0.03 0.10 0.08 0.21 0.28 0.35 0.06 0.14 0.25
total_kk_berdasarkan_pengguna_dan_non_pengguna_listrik 0.00 -0.45 0.00 0.02 0.33 -0.22 -0.18 0.14 0.58 -0.29 0.12 0.08 0.21 -0.21 -0.04 -0.11 0.54 -0.48 0.37 -0.20 1.00 0.26 0.01 0.44 0.44 0.41 -0.12 0.58 -0.22 0.00 0.30 0.09 0.12 0.07 0.08 0.04 -0.12 0.52 0.81 0.03 0.07 0.13 0.16 0.12 0.07 0.06 -0.49 -0.53 -0.32
rasio_elektrifikasi -0.04 -0.29 -0.03 -0.05 0.29 -0.11 -0.02 0.10 0.28 -0.34 0.02 0.06 0.24 -0.26 0.05 -0.13 0.29 -0.25 0.18 -0.17 0.26 1.00 0.14 0.20 0.19 0.22 -0.02 0.30 -0.16 0.02 0.05 0.05 0.11 0.05 0.05 0.01 0.03 0.00 0.30 -0.10 -0.04 0.01 0.03 -0.01 -0.04 -0.05 -0.31 -0.31 -0.05
rasio_sekolah_tinggi_sma_sederajat 0.06 -0.10 -0.12 0.01 -0.08 -0.04 0.02 -0.15 0.16 -0.02 -0.08 0.13 -0.12 0.04 -0.17 -0.05 0.01 -0.01 -0.07 0.07 0.01 0.14 1.00 0.29 0.31 0.00 -0.05 0.05 0.06 0.12 0.04 0.01 0.03 0.14 0.09 -0.05 -0.12 -0.07 0.06 0.02 -0.12 0.00 -0.01 -0.09 -0.09 -0.08 -0.19 -0.16 0.01
rasio_pt 0.04 -0.33 -0.15 0.05 0.04 -0.27 -0.11 -0.09 0.47 -0.20 0.01 0.01 -0.04 -0.12 -0.22 -0.05 0.31 -0.33 0.18 -0.01 0.44 0.20 0.29 1.00 0.65 0.23 -0.10 0.33 -0.11 0.00 0.09 0.03 -0.02 0.05 0.08 0.02 -0.03 0.06 0.48 -0.16 -0.14 0.03 0.04 0.09 0.07 0.07 -0.58 -0.53 -0.02
rasio_rs 0.02 -0.36 -0.17 0.02 0.02 -0.22 -0.04 -0.14 0.46 -0.21 0.03 0.23 -0.05 -0.10 -0.26 -0.02 0.26 -0.29 0.12 0.02 0.44 0.19 0.31 0.65 1.00 0.19 -0.10 0.25 -0.10 -0.05 0.05 -0.06 0.00 0.02 0.05 0.00 -0.03 0.05 0.47 -0.20 -0.18 -0.06 -0.03 0.01 0.00 0.00 -0.63 -0.57 0.03
rasio_faskes_1 -0.06 -0.22 -0.06 0.06 0.22 -0.21 -0.14 0.13 0.28 -0.23 0.04 0.07 0.18 -0.20 0.06 -0.04 0.33 -0.21 0.21 -0.05 0.41 0.22 0.00 0.23 0.19 1.00 -0.09 0.26 -0.17 -0.05 0.20 0.12 0.06 0.06 0.01 0.07 -0.02 0.14 0.38 -0.01 0.01 0.09 0.10 0.10 0.08 0.07 -0.26 -0.26 -0.12
rasio_pasar 0.15 0.15 0.14 -0.26 -0.06 0.17 0.20 -0.12 -0.18 0.13 -0.01 -0.02 -0.09 0.13 0.04 -0.02 -0.07 0.08 -0.08 -0.12 -0.12 -0.02 -0.05 -0.10 -0.10 -0.09 1.00 -0.14 0.01 0.00 -0.06 -0.01 -0.03 -0.02 -0.03 -0.02 -0.03 0.02 -0.15 -0.03 0.13 -0.03 -0.03 0.06 0.05 0.05 0.06 0.00 -0.05
rasio_minimarket_swalayan -0.19 -0.32 0.00 0.11 0.19 -0.30 -0.15 0.22 0.36 -0.18 0.11 0.03 0.08 -0.14 0.08 0.04 0.56 -0.29 0.31 -0.07 0.58 0.30 0.05 0.33 0.25 0.26 -0.14 1.00 -0.15 -0.11 0.37 0.12 0.07 0.05 0.08 0.03 -0.08 0.33 0.44 0.04 0.07 0.16 0.18 0.17 0.13 0.12 -0.40 -0.41 -0.21
banyak_kejadian_tanah_longsor_2018_2019 -0.06 0.24 -0.10 0.07 -0.31 0.05 -0.04 -0.01 -0.34 0.31 -0.16 -0.07 -0.27 0.21 0.07 0.30 -0.40 0.50 -0.32 0.32 -0.22 -0.16 0.06 -0.11 -0.10 -0.17 0.01 -0.15 1.00 0.17 -0.05 -0.12 0.02 0.02 0.21 0.14 0.06 -0.01 -0.25 0.18 0.03 0.00 -0.01 0.06 0.10 0.12 0.21 0.21 0.13
korban_jiwa_tanah_longsor_2018_2019 0.07 0.02 -0.01 -0.04 -0.08 0.05 -0.11 -0.01 -0.02 0.10 -0.03 -0.02 -0.10 0.09 0.03 0.05 -0.11 0.10 -0.06 0.05 0.00 0.02 0.12 0.00 -0.05 -0.05 0.00 -0.11 0.17 1.00 -0.02 -0.02 0.19 0.20 0.06 -0.04 -0.01 -0.09 0.06 0.07 0.02 -0.02 -0.05 0.01 0.00 0.02 0.06 0.03 -0.04
banyak_kejadian_banjir_2018_2019 -0.04 -0.04 0.07 0.02 0.03 -0.25 -0.10 0.23 0.10 -0.01 0.17 0.02 0.01 0.03 0.13 -0.02 0.50 -0.17 0.22 -0.06 0.30 0.05 0.04 0.09 0.05 0.20 -0.06 0.37 -0.05 -0.02 1.00 0.23 0.09 0.04 0.03 0.16 -0.06 0.19 0.22 0.13 0.20 0.18 0.18 0.17 0.13 0.13 -0.13 -0.16 -0.19
korban_jiwa_banjir_2018_2019 0.04 -0.04 0.00 -0.03 0.04 -0.13 -0.11 0.12 -0.02 -0.05 -0.05 -0.03 0.02 -0.01 0.07 -0.08 0.18 -0.13 0.09 -0.04 0.09 0.05 0.01 0.03 -0.06 0.12 -0.01 0.12 -0.12 -0.02 0.23 1.00 0.14 0.17 -0.02 0.02 -0.01 0.05 0.07 0.05 -0.05 0.15 0.13 0.04 0.02 0.02 0.01 0.01 -0.12
banyak_kejadian_banjir_bandang_2018_2019 0.10 -0.06 -0.02 -0.13 0.02 -0.11 -0.16 0.12 0.04 -0.01 -0.07 -0.05 0.04 -0.06 0.10 0.03 0.06 -0.04 0.16 -0.14 0.12 0.11 0.03 -0.02 0.00 0.06 -0.03 0.07 0.02 0.19 0.09 0.14 1.00 0.46 0.04 0.07 -0.02 -0.04 0.17 0.13 0.05 0.01 0.03 0.05 0.03 0.02 0.07 0.04 0.09
korban_jiwa_banjir_bandang_2018_2019 0.13 -0.03 -0.01 -0.15 -0.05 -0.07 -0.04 0.05 0.03 0.02 -0.03 -0.02 -0.07 0.03 0.03 -0.03 0.04 -0.03 0.00 -0.12 0.07 0.05 0.14 0.05 0.02 0.06 -0.02 0.05 0.02 0.20 0.04 0.17 0.46 1.00 -0.04 0.10 -0.01 -0.06 0.13 0.12 -0.02 0.01 -0.01 0.02 0.01 0.01 0.03 0.02 0.01
banyak_kejadian_kebakaran_hutan_dan_lahan_2018_2019 -0.06 0.06 0.12 0.04 -0.03 0.05 -0.03 -0.04 0.00 0.12 0.04 -0.01 -0.04 0.14 0.04 0.06 -0.06 0.06 -0.01 0.01 0.08 0.05 0.09 0.08 0.05 0.01 -0.03 0.08 0.21 0.06 0.03 -0.02 0.04 -0.04 1.00 0.15 -0.02 0.01 0.09 0.03 0.08 0.00 -0.03 0.03 0.03 0.03 -0.07 -0.08 0.02
banyak_kejadian_kekeringan_lahan_2018_2019 -0.06 0.08 0.08 0.01 -0.03 -0.03 -0.02 0.10 -0.02 0.13 0.02 -0.03 -0.04 0.08 0.04 0.05 0.05 0.04 0.11 -0.03 0.04 0.01 -0.05 0.02 0.00 0.07 -0.02 0.03 0.14 -0.04 0.16 0.02 0.07 0.10 0.15 1.00 0.18 0.03 0.03 0.14 0.03 0.05 0.06 0.05 0.05 0.04 0.04 0.02 0.01
korban_jiwa_kekeringan_lahan_2018_2019 -0.02 0.00 -0.09 0.03 -0.02 0.01 0.01 0.03 0.04 0.04 -0.01 -0.01 0.02 -0.05 0.01 0.02 -0.06 0.05 -0.04 0.06 -0.12 0.03 -0.12 -0.03 -0.03 -0.02 -0.03 -0.08 0.06 -0.01 -0.06 -0.01 -0.02 -0.01 -0.02 0.18 1.00 -0.11 -0.06 -0.07 -0.04 -0.05 -0.05 0.00 0.02 0.01 0.02 0.02 0.10
jumlah_sistem_peringatan_dini_bencana_alam -0.11 -0.07 0.14 0.09 0.23 -0.12 -0.02 0.17 0.03 -0.06 0.12 -0.04 0.18 -0.06 0.07 0.05 0.22 -0.13 0.11 -0.04 0.52 0.00 -0.07 0.06 0.05 0.14 0.02 0.33 -0.01 -0.09 0.19 0.05 -0.04 -0.06 0.01 0.03 -0.11 1.00 -0.08 0.22 0.25 0.16 0.19 0.15 0.14 0.13 0.02 -0.02 -0.30
persentase_sistem_peringatan_dini_bencana_alam 0.07 -0.48 -0.09 -0.04 0.23 -0.17 -0.19 0.05 0.66 -0.30 0.06 0.12 0.12 -0.21 -0.09 -0.16 0.48 -0.47 0.35 -0.21 0.81 0.30 0.06 0.48 0.47 0.38 -0.15 0.44 -0.25 0.06 0.22 0.07 0.17 0.13 0.09 0.03 -0.06 -0.08 1.00 -0.12 -0.09 0.03 0.05 0.04 -0.02 -0.02 -0.59 -0.60 -0.16
rasio_embung_di_kecamatan 0.10 0.07 0.34 0.03 -0.01 0.07 -0.20 0.23 -0.32 0.10 0.05 -0.07 0.00 0.02 0.25 0.12 -0.15 0.29 -0.23 -0.05 0.03 -0.10 0.02 -0.16 -0.20 -0.01 -0.03 0.04 0.18 0.07 0.13 0.05 0.13 0.12 0.03 0.14 -0.07 0.22 -0.12 1.00 0.16 0.08 0.08 0.10 0.10 0.08 0.24 0.16 -0.31
rasio_pasar_desa_pasar_hewan_pelelangan_ikan_pelelangan_hasil_pertanian_dll -0.12 0.07 0.20 0.00 0.07 0.03 0.13 0.20 -0.12 0.04 0.12 -0.06 0.06 0.05 0.23 0.07 0.09 0.00 0.12 -0.03 0.07 -0.04 -0.12 -0.14 -0.18 0.01 0.13 0.07 0.03 0.02 0.20 -0.05 0.05 -0.02 0.08 0.03 -0.04 0.25 -0.09 0.16 1.00 0.07 0.06 0.11 0.09 0.09 0.09 0.02 -0.26
jumlah_warga_penderita_gizi_buruk_marasmus_dan_kwashiorkor_pada_tahun_2018 0.09 0.02 -0.01 0.03 -0.01 -0.04 0.04 -0.02 0.02 0.06 0.01 0.06 -0.07 0.10 -0.08 -0.02 0.12 -0.07 0.09 0.10 0.13 0.01 0.00 0.03 -0.06 0.09 -0.03 0.16 0.00 -0.02 0.18 0.15 0.01 0.01 0.00 0.05 -0.05 0.16 0.03 0.08 0.07 1.00 0.97 -0.04 -0.05 -0.04 0.02 0.01 -0.11
rasio_warga_penderita_gizi_buruk_marasmus_dan_kwashiorkor_pada_tahun_2018 0.08 -0.01 0.01 0.02 0.02 -0.04 0.03 -0.01 0.02 0.04 0.02 0.05 -0.05 0.07 -0.07 -0.02 0.14 -0.07 0.08 0.08 0.16 0.03 -0.01 0.04 -0.03 0.10 -0.03 0.18 -0.01 -0.05 0.18 0.13 0.03 -0.01 -0.03 0.06 -0.05 0.19 0.05 0.08 0.06 0.97 1.00 -0.02 -0.03 -0.02 0.01 -0.01 -0.14
annual_mean_temp -0.12 -0.11 0.05 -0.06 -0.08 -0.21 -0.10 0.43 -0.17 0.05 0.04 -0.17 -0.11 0.01 0.51 0.66 0.26 0.14 0.12 0.21 0.12 -0.01 -0.09 0.09 0.01 0.10 0.06 0.17 0.06 0.01 0.17 0.04 0.05 0.02 0.03 0.05 0.00 0.15 0.04 0.10 0.11 -0.04 -0.02 1.00 0.99 0.99 -0.14 -0.17 -0.18
temp_change -0.14 -0.06 0.02 -0.01 -0.12 -0.23 -0.11 0.41 -0.21 0.08 0.04 -0.17 -0.14 0.03 0.50 0.69 0.20 0.20 0.10 0.28 0.07 -0.04 -0.09 0.07 0.00 0.08 0.05 0.13 0.10 0.00 0.13 0.02 0.03 0.01 0.03 0.05 0.02 0.14 -0.02 0.10 0.09 -0.05 -0.03 0.99 1.00 0.99 -0.11 -0.13 -0.13
annual_mean_prec -0.14 -0.06 -0.04 0.01 -0.16 -0.23 -0.12 0.40 -0.22 0.11 0.01 -0.16 -0.18 0.06 0.48 0.71 0.18 0.23 0.08 0.35 0.06 -0.05 -0.08 0.07 0.00 0.07 0.05 0.12 0.12 0.02 0.13 0.02 0.02 0.01 0.03 0.04 0.01 0.13 -0.02 0.08 0.09 -0.04 -0.02 0.99 0.99 1.00 -0.10 -0.12 -0.11
rasio_40pers_ekon_rendah_rt 0.11 0.31 0.05 -0.06 -0.10 0.17 0.00 0.05 -0.48 0.27 -0.19 -0.12 -0.01 0.11 0.17 0.06 -0.42 0.35 -0.23 0.06 -0.49 -0.31 -0.19 -0.58 -0.63 -0.26 0.06 -0.40 0.21 0.06 -0.13 0.01 0.07 0.03 -0.07 0.04 0.02 0.02 -0.59 0.24 0.09 0.02 0.01 -0.14 -0.11 -0.10 1.00 0.98 0.11
rasio_40pers_ekon_rendah_indv 0.08 0.28 -0.07 0.01 -0.13 0.10 -0.01 0.01 -0.45 0.25 -0.21 -0.10 -0.06 0.12 0.11 0.07 -0.42 0.35 -0.22 0.14 -0.53 -0.31 -0.16 -0.53 -0.57 -0.26 0.00 -0.41 0.21 0.03 -0.16 0.01 0.04 0.02 -0.08 0.02 0.02 -0.02 -0.60 0.16 0.02 0.01 -0.01 -0.17 -0.13 -0.12 0.98 1.00 0.20
prec_change 0.07 0.13 -0.46 0.09 -0.18 -0.10 -0.19 -0.20 0.04 0.07 -0.09 0.07 -0.05 0.03 -0.27 0.04 -0.23 0.13 0.07 0.25 -0.32 -0.05 0.01 -0.02 0.03 -0.12 -0.05 -0.21 0.13 -0.04 -0.19 -0.12 0.09 0.01 0.02 0.01 0.10 -0.30 -0.16 -0.31 -0.26 -0.11 -0.14 -0.18 -0.13 -0.11 0.11 0.20 1.00
Code
#corrplot(cor_matrix, method="circle", tl.cex = 0.5, type = "lower")

Systematically Remove Multicollinear Variables: decide on a correlation threshold that indicates multicollinearity (e.g., 0.8) and then manually or programmatically remove one variable from each pair or group that exceeds this threshold. The removed variables are:

Code
# Set a threshold value for correlation
threshold <- 0.8

# Find the indices of the variables that are highly correlated according to the specified threshold in the correlation matrix 'cor_matrix.'
# This will be used to remove the highly correlated variables from 'df_pre_pca.'
highly_correlated <- findCorrelation(cor(df_pre_pca), cutoff = threshold, names = FALSE)

# If you also need the names of the highly correlated variables, you can extract them using the indices.
highly_correlated_names <- colnames(cor(df_pre_pca))[highly_correlated]


# # Remove the columns from 'df_pre_pca' that correspond to the indices of the highly correlated variables.
# df_pre_pca <- df_pre_pca |> dplyr::select(-highly_correlated_names[[1]])
# Create a data frame from highly_correlated_names and threshold
highly_correlated_df <- data.frame(
  Predictor = highly_correlated_names
 # Threshold = threshold
)

# Create a gt table
highly_correlated_df %>%
  gt() %>%
  tab_header(
    title = "Highly Correlated Predictors",
    subtitle = paste("Predictors with correlation above", threshold)
  ) %>%
  cols_label(
    Predictor = "Predictor Name"
   # Threshold = "Correlation Threshold"
  )
Highly Correlated Predictors
Predictors with correlation above 0.8
Predictor Name
total_kk_berdasarkan_pengguna_dan_non_pengguna_listrik
percentage_of_forested_area_in_the_sub_district
rasio_40pers_ekon_rendah_rt
distance_to_forest
annual_mean_temp
annual_mean_prec
jumlah_warga_penderita_gizi_buruk_marasmus_dan_kwashiorkor_pada_tahun_2018

3 Exploratory data analysis (EDA)

Visualize the data and examine any patterns, distributions, or outliers.

Code
# Pivot the 'scaled_df_complete_temp' dataframe to long format
# Exclude certain columns from being pivoted and specify the new column names for 'variable' and 'value'
df_long <- scaled_df_complete_temp |> 
  dplyr::select(1:9) |> bind_cols(df_pre_pca) |> 
  tidyr::pivot_longer(
    cols = -c(idkec_dum, sumber, kdprov, nmprov, nmkab, nmkec, periode, kdkab, kdkec),
    names_to = "variable",
    values_to = "value"
  )

# Select only the 'variable' and 'value' columns from the long-form data
# Remove unwanted columns and group by 'variable' to prepare for summarization
gt_tab <- df_long |>
  dplyr::select(-c(idkec_dum, sumber, kdprov, nmprov, nmkab, nmkec, periode, kdkab, kdkec)) |>
  group_by(variable) 

# Calculate summary statistics for each group (each 'variable')
# Also, keep the original 'value' data in a list column called 'value_data'
# Use '.groups = "drop"' to return a regular dataframe rather than a grouped one
gt_stat <- gt_tab |> 
  summarise(
    min = min(value, na.rm = TRUE),
    max = max(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    #mean = mean(value, na.rm = TRUE),
    #sd = sd(value, na.rm = TRUE),
    value_data = list(value),
    .groups = "drop"
  )

# Create a GT table with a density plot column
# The density plot is generated from the 'value_data' list column
# Customize the appearance of the density plot and format the number columns
gt_stat |>
  gt::gt() |> 
  gtExtras::gt_plt_dist(
    value_data,
    type = "density",
    line_color = "blue",
    fill_color = "red"
  ) |> gt::fmt_number(columns = min:median, decimals = 1)
variable min max median value_data
annual_mean_prec −7.8 0.3 0.1
annual_mean_temp −7.8 0.2 0.2
arable_land_percent −4.5 0.6 0.3
aridity_index −4.5 1.5 0.3
banyak_kejadian_banjir_2018_2019 −1.1 1.5 0.5
banyak_kejadian_banjir_bandang_2018_2019 −0.3 4.0 −0.3
banyak_kejadian_kebakaran_hutan_dan_lahan_2018_2019 −0.3 4.1 −0.3
banyak_kejadian_kekeringan_lahan_2018_2019 −0.4 3.3 −0.4
banyak_kejadian_tanah_longsor_2018_2019 −0.8 1.8 −0.8
buffer_to_500m_irigated_land −0.7 2.1 −0.7
distance_to_burned_area −2.5 3.6 −0.1
distance_to_commodity_processing_factory −3.4 2.2 0.2
distance_to_deforestation −2.4 2.2 0.0
distance_to_forest −4.4 2.0 0.1
distance_to_plantation −3.1 2.9 0.1
distance_to_plantation_concession −4.4 0.6 0.3
distance_to_river −1.9 4.6 −0.1
distance_to_road −3.7 2.9 −0.1
erosion_risk_t_ha_1_yr_1 −8.3 1.1 0.2
indeks_bahaya_banjir −2.0 1.2 0.3
indeks_bahaya_longsor −1.5 1.1 0.4
jumlah_sistem_peringatan_dini_bencana_alam −2.7 2.8 0.0
jumlah_warga_penderita_gizi_buruk_marasmus_dan_kwashiorkor_pada_tahun_2018 −0.9 1.8 −0.9
korban_jiwa_banjir_2018_2019 −0.2 5.4 −0.2
korban_jiwa_banjir_bandang_2018_2019 −0.1 8.7 −0.1
korban_jiwa_kekeringan_lahan_2018_2019 −0.1 17.6 −0.1
korban_jiwa_tanah_longsor_2018_2019 −0.1 10.1 −0.1
percentage_deforestation_area_size −0.9 2.6 −0.3
percentage_of_agricultural_area_small_holder_in_the_village −3.2 1.0 0.3
percentage_of_forested_area_in_the_sub_district −1.1 1.6 0.0
percentage_of_plantation_area_per_sub_district −1.8 2.2 0.0
percentage_of_shrubland_in_the_sub_district −0.2 6.6 −0.2
percentage_of_water_area_compared_to_sub_district_area −0.2 11.6 −0.2
persentase_sistem_peringatan_dini_bencana_alam −2.7 3.2 0.0
prec_change −2.8 2.0 0.1
rasio_40pers_ekon_rendah_indv −4.3 2.4 0.1
rasio_40pers_ekon_rendah_rt −4.1 2.3 0.1
rasio_elektrifikasi −8.7 0.5 0.4
rasio_embung_di_kecamatan −1.2 1.4 0.6
rasio_faskes_1 −13.8 2.2 0.1
rasio_minimarket_swalayan −1.4 0.9 0.6
rasio_pasar −0.5 16.8 −0.1
rasio_pasar_desa_pasar_hewan_pelelangan_ikan_pelelangan_hasil_pertanian_dll −0.8 1.6 −0.8
rasio_pt −0.5 2.2 −0.5
rasio_rs −0.5 2.4 −0.5
rasio_sekolah_tinggi_sma_sederajat −2.0 2.8 −0.1
rasio_warga_penderita_gizi_buruk_marasmus_dan_kwashiorkor_pada_tahun_2018 −0.9 1.4 −0.9
temp_change −7.8 0.4 0.1
total_kk_berdasarkan_pengguna_dan_non_pengguna_listrik −2.9 2.9 0.1
Code
## Boxplot
# Histograms

# boxplot(df_pre_pca)

3.1 Density plots

Code
# Start the PNG device
png("output/density_plots.png", width = 2000, height = 2800)

# Set up the plot layout
par(mfrow = c(7, 7))

# Define the font size
font_size <- 4

# Loop through the columns of the data frame
for (i in 1:ncol(df_pre_pca)) {
  # Get the current variable
  variable <- df_pre_pca[, i]
  
  # Create a density plot for the variable with larger font size
  plot(density(variable[[1]]), main = "", xlab = "", ylab = "", cex.axis = font_size, cex.lab = font_size)
  
  # Add the title using the column name with larger font size
  title(main = colnames(df_pre_pca)[i], cex.main = font_size)
}

# Close the PNG device
dev.off()
png 
  2 

4 PCA Analysis

Code
# Perform PCA
pca_result <- prcomp(df_pre_pca)
Code
# Assuming pca_result is your prcomp object
pca_summary <- summary(pca_result)$importance |> round(digits = 2)



# Convert to tibble and remove row names
pca_summary_tibble <- tibble::rownames_to_column(as.data.frame(pca_summary), var = "Components")

# Create gt table
pca_summary_tibble %>%  
  gt() %>%
  tab_header(
    title = "Principal Component Analysis Summary",
    subtitle = "Importance of components"
  )
Principal Component Analysis Summary
Importance of components
Components PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32 PC33 PC34 PC35 PC36 PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 PC45 PC46 PC47 PC48 PC49
Standard deviation 2.81 2.26 2.00 1.74 1.61 1.44 1.34 1.31 1.16 1.12 1.08 1.06 1.03 1.01 0.99 0.97 0.94 0.89 0.89 0.88 0.86 0.85 0.83 0.77 0.75 0.74 0.70 0.67 0.66 0.61 0.61 0.59 0.56 0.53 0.50 0.49 0.47 0.42 0.40 0.36 0.34 0.3 0.27 0.23 0.16 0.1 0.05 0.02 0
Proportion of Variance 0.16 0.10 0.08 0.06 0.05 0.04 0.04 0.04 0.03 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.0 0.00 0.00 0
Cumulative Proportion 0.16 0.27 0.35 0.41 0.46 0.51 0.54 0.58 0.61 0.63 0.65 0.68 0.70 0.72 0.74 0.76 0.78 0.79 0.81 0.83 0.84 0.86 0.87 0.88 0.89 0.91 0.92 0.93 0.93 0.94 0.95 0.96 0.96 0.97 0.97 0.98 0.98 0.99 0.99 0.99 0.99 1.0 1.00 1.00 1.00 1.0 1.00 1.00 1

5 Future Steps

5.1 Reducing Variables in PCA Analysis

By selecting variables based on the significance of loadings for the top principal components, we can focus on the ones that contribute most to the data’s variance. This reduction simplifies the analysis and emphasizes the most influential factors.

5.2 Understanding the Principal Components (PCs)

5.2.1 The Biplot

Biplots graphically represent the relationship between observations and variables, providing insight into the PCs.

Code
biplot(pca_result)

5.2.2 The Scree Plot

A scree plot visualizes the proportion of variance explained by each PC. The first few PCs are the main contributors, while the rest have minimal influence.

Code
# Visualize the PCA result using a scree plot to see the variance explained by each principal component
plot(pca_result, type = "l")

5.2.3 A Look at Loadings

Loadings in PCA reveal how original variables influence each Principal Component. High absolute values indicate strong influence, while the sign indicates the direction. Understanding these can uncover underlying themes, such as climate factors.

5.2.3.1 The importance of the factors for each principal component

Code
# Extract the loadings for the desired number of principal components
num_pc <- 13
loadings <- pca_result$rotation[, 1:num_pc]

# Function to plot ordered loadings
plot_ordered_loadings <- function(loadings, pc_num) {
  ordered_indices <- order(-abs(loadings[, pc_num]), decreasing = TRUE)
  
  # Determine the colours based on the sign of the values
  bar_colours <- ifelse(loadings[ordered_indices, pc_num] < 0, "red", "blue")
  
  barplot(abs(loadings[ordered_indices, pc_num]), horiz = TRUE,
          main = paste("Loadings for PC", pc_num),
          names.arg = rownames(loadings)[ordered_indices],
          las = 1, cex.names = 0.7, xlim = c(0, 0.5), col = bar_colours)
}

# Set up the plotting area with adjusted margins
par(mar = c(5, 25, 4, 2))

# Loop over each principal component and plot
for (i in 1:num_pc) {
  plot_ordered_loadings(loadings, i)
}

To identify the factors that are consistently least influential across the first three principal components, we can look at the absolute values of the loadings. Factors with small absolute loadings are less influential. Knowledge of the specific field aids in interpreting PCs. Principal Components can be complex and may resist straightforward interpretation.

5.3 Classifying Data with K-Means Clustering and Principal Components

5.3.1 Step 1: Selecting the Principal Components

Start by using the top three Principal Components from PCA, which capture the core variances and correlations in your data.

Code
# Extract the first three principal components
selected_components <- pca_result$x[, 1:13]

5.3.2 Step 2: Determine the Number of Clusters

Select the optimal number of clusters (k) using methods like the Elbow Method or Silhouette Analysis.

5.3.2.1 Elbow Method

The Elbow Method involves running k-means clustering for a range of \(k\) values and plotting the total within-cluster sum of squares. The “elbow” of the plot represents an optimal value for \(k\) (a balance between precision and computational cost).

Code
# Compute total within-cluster sum of squares for different k values
wss <- sapply(1:10, function(k) {
  kmeans(selected_components, centers = k)$tot.withinss
})

# Plot the total within-cluster sum of squares
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Total Within-Cluster Sum of Squares",
     main = "Elbow Method")

based on the numbers, you might consider the point where the decrease starts to slow down, which could be around \(k\)= 4 or \(k\)=5.

5.3.2.2 Silhouette Analysis

Silhouette Analysis measures how similar an object is to its own cluster compared to other clusters. The silhouette score ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

Code
library(cluster)

# Compute silhouette scores for different k values
silhouette_scores <- sapply(2:10, function(k) {
  cluster_result <- kmeans(selected_components, centers = k)
  silhouette_avg <- mean(silhouette(cluster_result$cluster, dist(selected_components))[, "sil_width"])
  silhouette_avg
})

# Plot the silhouette scores
plot(2:10, silhouette_scores, type = "b", xlab = "Number of Clusters", ylab = "Average Silhouette Width",
     main = "Silhouette Analysis")

The maximum silhouette score is 0.3428859, which corresponds to 5 clusters (since the first value represents \(k\)=2)

5.3.3 Step 3: Applying K-Means Algorithm

Utilize the k-means algorithm to divide the data into k clusters, focusing on the first three PCs. In R, this can be done with:

Code
# Choose the number of clusters
k <- 5

# Perform k-means clustering
kmeans_result <- kmeans(selected_components, centers = k)

5.3.4 Step 4: Interpret the Clusters

Investigate each cluster to understand the common traits within them. Interpretation requires a blend of data analysis and domain-specific knowledge.

Code
library(ggplot2)

# Create a data frame for plotting
plot_data <- data.frame(selected_components, cluster = as.factor(kmeans_result$cluster))

# Plot the first two principal components and color by cluster
ggplot(plot_data, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point() +
  labs(title = "K-means Clustering on First Two Principal Components") +
  theme_minimal()

An interactive 3D scatter plot below shows the first three principal components, colored by cluster. We can rotate the plot to view it from different angles, and you can hover over the points to see additional information.

Code
# Load the plotly package
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
Code
# Create a data frame for plotting
plot_data <- data.frame(selected_components, cluster = as.factor(kmeans_result$cluster))

# Create the 3D scatter plot
plot_3d <- plot_ly(data = plot_data, x = ~PC1, y = ~PC2, z = ~PC3, color = ~cluster, type = "scatter3d") 
# Show the plot
plot_3d
No scatter3d mode specifed:
  Setting the mode to markers
  Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

6 Exploratory data analysis

6.1 Grids of box plots for each parameters

Code
pca_eda <- tibble(df_pre_pca, cluster = as.factor(kmeans_result$cluster))

library(ggplot2)
library(gridExtra)

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine
Code
# Convert the data frame from a wide to a long format  
pca_eda_long <- pca_eda|> 
  tidyr::gather(key = "variable", value = "value", -cluster)

#Create a list of ggplot objects for each variable
plots <- lapply(unique(pca_eda_long$variable), function(var) {
  ggplot(pca_eda_long[pca_eda_long$variable == var,], aes(x = cluster, y = value)) +
    geom_boxplot() +
    labs(title = var, x = "Cluster", y = "Value") +
    theme_minimal()
})
grid_plot <- do.call(gridExtra::grid.arrange, c(plots, ncol = 5))

Code
ggsave("output/boxplots.png", grid_plot, width = 20, height = 15)

6.2 Visualise the clusters into a map

7 Read a map then attach

Code
library(terra)
terra 1.7.39

Attaching package: 'terra'
The following object is masked from 'package:naniar':

    shade
Code
library(sf)
Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE
Code
# Read the shapefile
desa <- st_read("data/INDO_DESA_2019/INDO_DESA_2019.shp")
Reading layer `INDO_DESA_2019' from data source 
  `D:\OneDrive - CIFOR-ICRAF\Documents\GitHub\vulnerability-typology-map\data\INDO_DESA_2019\INDO_DESA_2019.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 84091 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 95.00971 ymin: -11.00766 xmax: 141.02 ymax: 6.076809
Geodetic CRS:  WGS 84
Code
# Filter based on the 'kdprov' attribute
desa_sulsel <- desa |> filter(kdprov %in% 73)
desa_sulsel <- desa_sulsel |> dplyr::select(-c("kddesa", "iddesa"))
desa_sulsel_id_kec <- desa_sulsel |> select(nmkab,nmkec,kdprov,kdkab,kdkec) |> 
  mutate(idkec_dum = paste0(kdprov,kdkab,kdkec, "a") )
rm(desa)

# add the cluster attribute to the desa_sulsel object
cluster_data<- scaled_df_complete_temp |> select(-c(nmkab, nmkec ,nmprov ,  sumber, kdprov,kdkab,kdkec)) |> 
  dplyr::select(1:9) |> bind_cols(tibble (cluster = kmeans_result$cluster))

clusters_sulsel<- desa_sulsel_id_kec |> left_join(cluster_data, by = "idkec_dum")

# Write the sf object to a shapefile
st_write(clusters_sulsel, "output/tipologi_5_kelas.shp", append = TRUE)
Warning in abbreviate_shapefile_names(obj): Field names abbreviated for ESRI
Shapefile driver
Updating layer `tipologi_5_kelas' to data source `output/tipologi_5_kelas.shp' using driver `ESRI Shapefile'
Updating existing layer tipologi_5_kelas
Writing 3053 features with 15 fields and geometry type Multi Polygon.
Code
# # Plot the SpatVector object, using the 'cluster' attribute for the fill color
# plot(desa_sulsel, "cluster", col=c("#E69F00", "#CC79A7", "#009E73", "#F0E442", "#0072B2"), lwd=0.1) # Assuming we have 5 clusters

library(leaflet)
# Create a color palette for the 'cluster' variable
pal <- colorFactor(palette = "Set1", domain = clusters_sulsel$cluster)

# Constructing the label string first
clusters_sulsel$label_content <- with(clusters_sulsel, 
                                      paste0("<strong>Kabupaten:</strong> ", nmkab, "<br>",
                                             "<strong>Kecamatan:</strong> ", nmkec, "<br>",
                                             "<strong>Cluster:</strong> ", cluster)) |> lapply(htmltools::HTML)

# Check the first few rows to ensure the label is being constructed correctly
head(clusters_sulsel$label_content)
[[1]]
<strong>Kabupaten:</strong> KEPULAUAN SELAYAR<br><strong>Kecamatan:</strong> BONTOHARU<br><strong>Cluster:</strong> 1

[[2]]
<strong>Kabupaten:</strong> SOPPENG<br><strong>Kecamatan:</strong> MARIO RIWAWO<br><strong>Cluster:</strong> 4

[[3]]
<strong>Kabupaten:</strong> KEPULAUAN SELAYAR<br><strong>Kecamatan:</strong> BONTOMANAI<br><strong>Cluster:</strong> 1

[[4]]
<strong>Kabupaten:</strong> KEPULAUAN SELAYAR<br><strong>Kecamatan:</strong> BONTOMANAI<br><strong>Cluster:</strong> 1

[[5]]
<strong>Kabupaten:</strong> KEPULAUAN SELAYAR<br><strong>Kecamatan:</strong> BUKI<br><strong>Cluster:</strong> 1

[[6]]
<strong>Kabupaten:</strong> BULUKUMBA<br><strong>Kecamatan:</strong> GANTARANG<br><strong>Cluster:</strong> 4
Code
# Create the leaflet map with HTML-rendered labels
leaflet(clusters_sulsel) %>%
  addProviderTiles(providers$OpenStreetMap) %>%
  addPolygons(
    fillColor = ~pal(cluster),
    weight = 0.5,
    opacity = 1,
    color = "white",
    dashArray = "3",
    fillOpacity = 0.7,
    highlight = highlightOptions(
      weight = 5,
      color = "#666",
      dashArray = "",
      fillOpacity = 0.7,
      bringToFront = TRUE
    ),
    label = ~label_content,
    labelOptions = labelOptions(
      noHide = FALSE,
      direction = 'auto'
    )
  ) %>%
  addLegend(pal = pal, values = ~cluster, title = "Cluster")

8 Caveats

  • Lack of Data Understanding: The analysis was conducted without in-depth knowledge of the data, so the results should be interpreted cautiously.

  • Skewed Distribution: Many variables have a right-skewed distribution, which may violate the linearity assumption if not properly transformed.

  • No one-size-fits-all: Data pre-processing should be tailored to the specific dataset and analysis objectives. This includes careful consideration of data transformation and variable selection, based on both statistical properties and subject-matter knowledge.